{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook 3: Basic Application Development\n",
    "\n",
    "This notebook introduces the essential usage of `bigdl-llm`, and walks you through building a very basic chat application.\n",
    "\n",
    "## 3.1 Install `bigdl-llm`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you haven't installed `bigdl-llm`, install it as shown below. The one-line command will install the latest `bigdl-llm` with all the dependencies for common LLM application development."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --pre --upgrade bigdl-llm[all]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 Load a pretrained Model\n",
    "\n",
    "Before using a LLM, you need to first load one. Here we take a relatively small LLM, i.e. [open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2) as an example.\n",
    "\n",
    "> **Note**\n",
    ">\n",
    "> * `open_llama_3b_v2` is an open-source large language model based on the LLaMA architecture. You can find more information about this model on its [homepage](https://huggingface.co/openlm-research/open_llama_3b_v2) hosted on Hugging Face.\n",
    "\n",
    "### 3.2.1 Load and Optimize Model\n",
    " \n",
    "In general, you just need one-line `optimize_model` to easily optimize any loaded PyTorch model, regardless of the library or API you are using. For more detailed usage of optimize_model, please refer to the [API documentation](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html#).\n",
    "\n",
    "Besides, many popular open-source PyTorch large language models can be loaded using the `Huggingface Transformers API` (such as [AutoModel](https://huggingface.co/docs/transformers/v4.33.2/en/model_doc/auto#transformers.AutoModel), [AutoModelForCasualLM](https://huggingface.co/docs/transformers/v4.33.2/en/model_doc/auto#transformers.AutoModelForCausalLM), etc.). For such models, bigdl-llm also provides a set of APIs to support them. We will now demonstrate how to use them.\n",
    "\n",
    "In this example, we use `bigdl.llm.transformers.AutoModelForCausalLM` to load the `open_llama_3b_v2 model`. This API mirrors the official `transformers.AutoModelForCasualLM` with only a few additional parameters and methods related to low-bit optimization in the loading process.\n",
    "\n",
    "To enable INT4 optimization, simply set `load_in_4bit=True` in `from_pretrained`. Additionally, we configure the parameters `torch_dtype=\"auto\"` and `low_cpu_mem_usage=True` by default, as they may improve both performance and memory efficiency. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bigdl.llm.transformers import AutoModelForCausalLM\n",
    "\n",
    "model_path = 'openlm-research/open_llama_3b_v2'\n",
    "\n",
    "model = AutoModelForCausalLM.from_pretrained(model_path,\n",
    "                                             load_in_4bit=True)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Note**\n",
    ">\n",
    "> * If you want to use precisions other than INT4(e.g. NF4/INT5/INT8,etc.), or know more details about the arguments, please refer to [API document](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/transformers.html#) for more information. \n",
    ">\n",
    "> * `openlm-research/open_llama_3b_v2` is the **_model_id_** of the model `open_llama_3b_v2` on huggingface. When you set the `model_path` parameter of `from_pretrained` to this **_model_id_**, `from_pretrained` will automatically download the model from huggingface,  cache it locally (e.g. `~/.cache/huggingface`), and load it. It may take a long time to download the model using this API. Alternatively, you can download the model yourself, and set `model_path` to the local path of the downloaded model. For more information, refer to the [`from_pretrained` document](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.from_pretrained).\n",
    "\n",
    "\n",
    "### 3.2.2 Save & Load Optimized Model\n",
    "\n",
    "In the previous section, models loaded using the `Huggingface Transformers API` are typically stored with either fp32 or fp16 precision. To save model space and speedup loading processes, bigdl-llm also provides the `save_low_bit` API for saving the model after low-bit optimization, and the `load_low_bit` API for loading the saved low-bit model.\n",
    "\n",
    "You can use `save_low_bit` once and use `load_low_bit` many times for inference. This approach bypasses the processes of loading the original FP32/FP16 model and optimization during inference stage, saving both memory and time. Moreover, because the optimized model format is platform-agnostic, you can seamlessly perform saving and loading operations across various machines, regardless of their operating systems. This flexibility enables you to perform optimization/saving on a high-RAM server and deploy the model for inference on a PC with limited RAM.\n",
    "\n",
    "\n",
    "**Save Optimized Model**\n",
    "\n",
    "For example, you can use the `save_low_bit` function to save the optimized model as below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "save_directory = './open-llama-3b-v2-bigdl-llm-INT4'\n",
    "\n",
    "model.save_low_bit(save_directory)\n",
    "del(model)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Load Optimized Model**\n",
    "\n",
    "Then use `load_low_bit` to load the optimized low-bit model as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers\n",
    "model = AutoModelForCausalLM.load_low_bit(save_directory)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.3 Building a Simple Chat Application\n",
    "\n",
    "Now that the model is successfully loaded, we can start building our very first chat application. We shall use the `Huggingface transformers` inference API to do this job.\n",
    "\n",
    "> **Note**\n",
    "> \n",
    "> The code in this section is solely implemented using `Huggingface transformers` API. `bigdl-llm` does not require any change in the inference code so you can use any libraries to build your appliction at inference stage.  \n",
    "\n",
    "> **Note**\n",
    "> \n",
    "> Here we use Q&A dialog prompt template so that it can answer our questions.\n",
    "\n",
    "\n",
    "> **Note**\n",
    "> \n",
    "> `max_new_tokens` parameter in the `generate` function defines the maximum number of tokens to predict. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import LlamaTokenizer\n",
    "\n",
    "tokenizer = LlamaTokenizer.from_pretrained(model_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------------- Output --------------------\n",
      "Q: What is CPU?\n",
      "A: CPU stands for Central Processing Unit. It is the brain of the computer.\n",
      "Q: What is RAM?\n",
      "A: RAM stands for Random Access Memory.\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "with torch.inference_mode():\n",
    "    prompt = 'Q: What is CPU?\\nA:'\n",
    "    \n",
    "    # tokenize the input prompt from string to token ids\n",
    "    input_ids = tokenizer.encode(prompt, return_tensors=\"pt\")\n",
    "    # predict the next tokens (maximum 32) based on the input token ids\n",
    "    output = model.generate(input_ids, max_new_tokens=32)\n",
    "    # decode the predicted token ids to output string\n",
    "    output_str = tokenizer.decode(output[0], skip_special_tokens=True)\n",
    "\n",
    "    print('-'*20, 'Output', '-'*20)\n",
    "    print(output_str)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
